Text Analysis with R for Students of Literature by Matthew L. Jockers

Text Analysis with R for Students of Literature by Matthew L. Jockers

Author:Matthew L. Jockers
Language: eng
Format: epub, pdf
Publisher: Springer International Publishing, Cham


10.2 The Text Encoding Initiative (TEI)

The Text Encoding Initiative (TEI) offers a document-encoding standard that is commonly used by humanities scholars. The TEI markup scheme provides a way of storing an original text file alongside an almost infinite amount of metadata. Since the files are extensible and editable, the amount of metadata available is only limited by the encoder’s willingness to modify the documents. Say for example, you are collecting novels written by Irish– and German–American authors. For this project you might have a metadata field in your document where you can indicate the author’s national origins. You may have another field where you indicate the author’s gender, or birth date, or race, or sexual orientation. Once metadata of this sort is added to the XML files, it can be easily accessed by computer scripts and used, for example, as a sorting facet for a particular type of analysis.

In the rest of this book, you will be working with a corpus of texts that are encoded in TEI compliant XML. Unlike the plain text files (Moby Dick and Sense and Sensibility) that you have processed thus far, these TEI-XML files contain extra-textual information in the metadata of the <teiHeader> element. To proceed, you must be able to parse the XML and extract the metadata while also separating out the actual text of the book from the marked up apparatus around the book. You need to know how to parse XML in R.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.